Further Analysis

Research Question 1: Usage Context and Platform Patterns

Sentiment Analysis of User-Selected Content

To further address our first research question—“What are the common contexts and platforms where users engage with Jargon?”—we performed sentiment analysis on the English original sentences that users selected for learning. Using the syuzhet package in R, each sentence was assigned a sentiment score, where positive values indicate positive sentiment, negative values indicate negative sentiment, and values near zero indicate neutral sentiment. This approach allows us to quantitatively assess the emotional tone of the content users choose to engage with.

Topic Modeling of User-Selected Content (LDA)

To further explore the contexts in which users engage with Jargon, we applied Latent Dirichlet Allocation (LDA) topic modeling to the English original sentences selected by users. In addition to standard stopwords, we removed a custom list of common or uninformative words to improve topic quality. This method uncovers the main themes or topics present in the content users choose to learn from.

Research Question 2: Feature Adoption and User Success

Correlation Analysis

To explore the relationships between user features and engagement metrics, we computed a correlation matrix for key variables in the enhanced_profiles dataset. This helps identify which features are associated with higher engagement or other usage patterns.

K-Means Clustering to Define Active vs. Occasional Users

To provide a data-driven segmentation of user engagement, we applied k-means clustering to the key engagement metrics: generated questions, answered questions, blocked sites, and levels attempted. This approach groups users into clusters based on their overall activity patterns, rather than relying on arbitrary thresholds or quantiles.

Figure 11: K-means clustering of users based on engagement metrics. Each point represents a user, colored by cluster (Active or Occasional) in PCA space.

The PCA plot of K-means clustering reveals a clear separation between two user groups: ‘Active’ and ‘Occasional’. The majority of users are clustered closely together in the ‘Occasional’ group, indicating similar and relatively low engagement across key metrics. In contrast, only a few users are classified as ‘Active’, and these are well separated from the main cluster, highlighting their much higher engagement levels. This pattern suggests that while most users interact with the platform at a modest level, a small subset of users are highly engaged, driving much of the activity. The presence of only a few very active users is typical in many online platforms, where a minority of users contribute disproportionately to overall engagement. This finding align with the Figure 10 from our Data Exploration.

Summary Statistics by K-Means Group
kmeans_group n mean_generated_questions mean_answered_questions mean_blocked_sites mean_levels_attempted
Active 4 325.25000 325.25000 2.7500000 4.500000
Occasional 88 12.96591 12.96591 0.1818182 1.113636
Table 8: Summary statistics for k-means-defined active and occasional users

The summary statistics in Table 8 show a clear gradient in engagement: ‘Active’ users (identified by k-means) have much higher activity across all metrics, but are few in number. To provide a more meaningful middle group, the ‘Occasional’ category (top 25% by generated questions among non-active users) was added, since k-means alone identified only a small number of highly engaged users. This three-group segmentation allows for a more nuanced comparison of user behaviors.

User Segmentation: Very Active, Active, and Regular Users

To provide a more nuanced segmentation, we define three user groups:

  • Very Active: Users in the k-means ‘Active’ cluster (highest engagement across all metrics)
  • Active: Among the remaining users, those in the top 25% by generated_questions
  • Regular: All other users
user_group3 n mean_generated_questions mean_answered_questions mean_blocked_sites mean_levels_attempted
Active 23 38.826087 38.826087 0.2608696 1.304348
Regular 65 3.815385 3.815385 0.1538462 1.046154
Very Active 4 325.250000 325.250000 2.7500000 4.500000
Table 9: Summary statistics for Very Active, Active, and Regular users

The summary statistics in Table 9 show a clear gradient in engagement: ‘Very Active’ users (identified by k-means) have much higher activity across all metrics, but are few in number. To provide a more meaningful middle group, the ‘Active’ category (top 25% by generated questions among non-very-active users) was added, since k-means alone identified only a small number of highly engaged users. This three-group segmentation allows for a more nuanced comparison of user behaviors.